feat: generate GeoJSON from PostgreSQL instead of re-reading CSV by ThibaudDauce · Pull Request #404 · datagouv/hydra

ThibaudDauce · 2026-04-02T10:39:36Z

Stream GeoJSON features directly from the database using a cursor, avoiding a third read of the CSV file. PostgreSQL builds the JSON with json_build_object, so no Python-level casting is needed.

Also extract _detect_geo_columns to deduplicate the geo column detection logic.

Will allow to send this part in another job.

On MN_07_latest-2025-2026.csv: 50s → 18s (-64%)
On joconde.csv: 75s → 43s (-42%)

Same as #402 but for GeoJSON

Stream GeoJSON features directly from the database using a cursor, avoiding a third read of the CSV file. PostgreSQL builds the JSON with json_build_object, so no Python-level casting is needed. Also extract _detect_geo_columns to deduplicate the geo column detection logic, and add detailed timer marks for the geojson and pmtiles conversion steps.

ThibaudDauce · 2026-04-02T13:03:15Z

Still a bug when a lot of column, will push a fix next week!

bolinocroustibat · 2026-04-03T13:32:57Z

I might wait for that PR to be finished to be able to base this new PR on it, to separate csv_to_geojson and geojson_to_parquet as independant tasks in the RQ queue. Those tasks needs to make sure the input data is not temporary, so parquet in DB would work in that direction.

cc @ThibaudDauce @maudetes

# Conflicts: # tests/test_analysis/test_parquet_export.py

Pierlou

Thanks for the refactor 🙏 The flow looks good to me (though I haven't gone through the technicity of the SQL part), but I have a doubt regarding the type of the properties we export, and a couple of remarks.

Pierlou · 2026-04-14T14:59:15Z

+    return query, params
+
+
+async def csv_to_geojson_from_db(


We may want to keep the function name close to its parquet counterpart

Suggested change

async def csv_to_geojson_from_db(

async def save_as_geojson_from_db(

... wouldn't db_to_geojson and db_to_parquet be even more explicit, and more consistent with namings like csv_to_geojson?

Done in df12267

Pierlou · 2026-04-14T15:03:21Z

-                # ending up here means we either have the exact lat,lon format, or NaN
-                # skipping row if NaN
-                if row[geo["latlon"]] is None:
+            elif "latlon" in geo or "lonlat" in geo:


Keeping the insight

Suggested change

elif "latlon" in geo or "lonlat" in geo:

elif "latlon" in geo or "lonlat" in geo:

# skipping row if geo data is None

Done in df12267

Pierlou · 2026-04-14T15:04:29Z

+
+# latlon/lonlat columns can contain values like "[48.8566, 2.3522]" or "48.8566 , 2.3522"
+# Both versions below strip spaces and brackets, then split on comma.
+


The insight is already given within the function, this feels misplaced here

Suggested change

# latlon/lonlat columns can contain values like "[48.8566, 2.3522]" or "48.8566 , 2.3522"

# Both versions below strip spaces and brackets, then split on comma.

Done in df12267

Pierlou · 2026-04-14T15:14:15Z

+    for col in property_cols:
+        params.append(col)
+        placeholder = f"${len(params)}::text"
+        properties_fragments.append(f"{placeholder}, {_quote_ident(_db_col_name(col))}")


NIT

Suggested change

for col in property_cols:

params.append(col)

placeholder = f"${len(params)}::text"

properties_fragments.append(f"{placeholder}, {_quote_ident(_db_col_name(col))}")

for idx, col in enumerate(property_cols):

params.append(col)

properties_fragments.append(f"${idx + 1}::text, {_quote_ident(_db_col_name(col))}")

Done in df12267

Pierlou · 2026-04-14T15:18:43Z

+    properties_fragments = []
+    for col in property_cols:
+        params.append(col)
+        placeholder = f"${len(params)}::text"


Does that mean we're exporting all values as text? In which case it is not what we want (columns that contain ints/floats should be exported as such)

No here we talk about the columns/keys, I've added a comment in df12267

Pierlou · 2026-04-14T15:27:24Z

+                first = True
+                async for row in cursor:
+                    if not first:
+                        f.write(",\n")
+                    f.write(row[0])
+                    first = False


I don't think I understand the first trick: why don't we f.write(f"{row[0]},\n") for all rows? (in which case we should finally f.write("]"))

It will have a trailing comma in the JSON?

bolinocroustibat

Thank you. A few remarks otherwise LGTM

bolinocroustibat · 2026-04-16T14:18:43Z

+            with output_file_path.open("w") as f:
+                f.write('{"type": "FeatureCollection", "features": [\n')
+                first = True
+                async for row in cursor:


What happens in case of an error while parsing the rows?
Should we delete output_file_path in a try/finally if we detect failure after partial write, or wrap the stream loop in try/finally to remove the output file on any error after open?

The caller csv_to_geojson_and_pmtiles in csv.py:193-194 already wraps the call in a try/except that calls remove_remainders(resource_id, ["geojson", "pmtiles", "pmtiles-journal"]). So if an error occurs during streaming, the partial file is cleaned up by the caller.

This is the same pattern as the CSV path — csv_to_geojson doesn't handle cleanup on error either, it's delegated to the caller.

bolinocroustibat · 2026-04-16T14:19:46Z

+    upload_to_minio: bool = True,
+) -> tuple[int, str | None] | None:
+    """Generate a GeoJSON file by streaming features directly from PostgreSQL."""
+    geo = _detect_geo_columns(inspection)


NIT: a type hint would be nice here for readability:

geo: dict | None = _detect_geo_columns(inspection)

_detect_geo_columns already have a return-type? You're IDE should provide you the type hint on hover, no?

My IDE doesn't :) I type hint when not obvious at first sight

bolinocroustibat · 2026-04-16T14:23:33Z

+    return query, params
+
+
+async def csv_to_geojson_from_db(


... wouldn't db_to_geojson and db_to_parquet be even more explicit, and more consistent with namings like csv_to_geojson?

bolinocroustibat · 2026-04-16T14:28:02Z

+    output_file_path: Path,
+    upload_to_minio: bool = True,
+) -> tuple[int, str | None] | None:
+    """Generate a GeoJSON file by streaming features directly from PostgreSQL."""


NIT: csv_to_geojson_from_db has a one-line docstring while csv_to_geojson documents args, behavior (skipped rows), and return values. Aligning the DB variant (even briefly: inputs, streaming, parity goal with the CSV path) might make it easier to maintain and choose between the two in case we have to.

Done in df12267

bolinocroustibat · 2026-04-16T14:32:42Z

+                async for row in cursor:
+                    if not first:
+                        f.write(",\n")
+                    f.write(row[0])


NIT: Might be a good opportunity to ad a column alias on the outer SELECT in _build_feature_sql, to avoid using row[0], prefering somtehing like row["feature_json"], that would massively improve readability imho.

Done in df12267

bolinocroustibat · 2026-04-16T14:41:53Z

+    template["features"] = streamable_list(get_features(file_path, inspection, geo))
+
+    with output_file_path.open("w") as f:
+        json.dump(template, f, indent=4, ensure_ascii=False, default=str)


GeoJSON files are often large, pretty-printing can add a meaningful amount of redundant bytes on disk and for uploads, even if compression narrows the gap.

Also indent=4 gets a pretty-printed JSON while the database path stream does not re-indent the whole document like indent=4 would. Someone diffing two exports, grepping a file might think something broke when they see the compact stream.

Should we treat compact as the normal export format for both paths, dropping indent=4 on the CSV path? And assume if a human needs to read the file, s.he will format it automatically anyway?

Yes but this path is only for csv_to_geojson (the line only moved), I think we can leave it since it will be removed soon?

bolinocroustibat

Better naming suggestion

EDIT: doesn't pass the linting

bolinocroustibat · 2026-04-16T15:55:37Z

    # Convert to GeoJSON — from DB if available and enabled, otherwise from CSV file
    if config.DB_TO_GEOJSON and table_name:
-        result = await csv_to_geojson_from_db(
+        result = await save_as_geojson_from_db(


I think we should use this opportunity to rename:
save_as_geojson_from_db -> db_to_geojson
save_as_parquet_from_db-> db_to_parquet
Those names match the source_to_target style we already use (csv_to_parquet, csv_to_db, geojson_to_pmtiles, parquet_to_db), they also line up nicely with our config flags.

Optionally to match with that we could also rename save_as_parquet to rows_to_parquet

ThibaudDauce force-pushed the geojson-from-db branch from 663d2a3 to 3999e41 Compare April 2, 2026 10:41

ThibaudDauce force-pushed the geojson-from-db branch from 3999e41 to 3181c12 Compare April 2, 2026 10:47

ThibaudDauce added 4 commits April 2, 2026 13:25

add tests

76a7757

fix bug with reserved column renaming

8fc9ead

prevent SQL injections

211caaf

simplify lat lon and lon lat

ebd2a94

ThibaudDauce requested review from Pierlou, bolinocroustibat and maudetes and removed request for Pierlou and bolinocroustibat April 2, 2026 11:39

ThibaudDauce marked this pull request as ready for review April 2, 2026 11:39

ThibaudDauce requested a review from bolinocroustibat April 2, 2026 11:39

ThibaudDauce added 2 commits April 2, 2026 13:42

lint

21a2c58

add DB_TO_GEOJSON config

67172ad

ThibaudDauce marked this pull request as draft April 2, 2026 13:03

This was referenced Apr 3, 2026

feat(csv): split GeoJSON and PMTiles into low-priority RQ jobs #408

Open

[hydra] Découper les tâches d'analyse csv en sous-tâches #412

Open

fix more than 100 arguments to pg function

23cdfd7

ThibaudDauce marked this pull request as ready for review April 6, 2026 11:15

Merge remote-tracking branch 'origin/main' into geojson-from-db

ba7d1bc

# Conflicts: # tests/test_analysis/test_parquet_export.py

maudetes requested review from Pierlou and maudetes and removed request for Pierlou, bolinocroustibat and maudetes April 14, 2026 09:56

maudetes requested a review from bolinocroustibat April 14, 2026 13:09

Pierlou reviewed Apr 14, 2026

View reviewed changes

bolinocroustibat approved these changes Apr 16, 2026

View reviewed changes

ThibaudDauce added 3 commits April 16, 2026 17:01

apply code review

df12267

add tests

2b5038b

lint

661c927

bolinocroustibat approved these changes Apr 16, 2026

View reviewed changes

revert unrelated date_format changes

8bf53b3

	async def csv_to_geojson_from_db(
	async def save_as_geojson_from_db(

	elif "latlon" in geo or "lonlat" in geo:
	elif "latlon" in geo or "lonlat" in geo:
	# skipping row if geo data is None


		# latlon/lonlat columns can contain values like "[48.8566, 2.3522]" or "48.8566 , 2.3522"
		# Both versions below strip spaces and brackets, then split on comma.

Conversation

ThibaudDauce commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThibaudDauce commented Apr 2, 2026

Uh oh!

bolinocroustibat commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pierlou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bolinocroustibat left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bolinocroustibat left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ThibaudDauce commented Apr 2, 2026 •

edited

Loading

bolinocroustibat commented Apr 3, 2026 •

edited

Loading

bolinocroustibat left a comment •

edited

Loading